Blang workshop
Alexandre Bouchard-Côté
11/13/2019
Logistics
- Finding these slides: https://github.com/alexandrebouchard/blangWorkshop/slides.html
- Code examples:
- Via Silico, a web scientific modelling tool supporting the Blang PPL
- You should receive an email with link to complimentary Silico plan
- Once logged in Silico, follow this link (when asked for a student number, enter arbitrary digits): TODO
- If you find Silico/Blang useful, please spread the word! Consider using in your research/classroom/consulting.
Other ways to run Blang (can help setup after the workshop)
- Command line
- Desktop IDE
- R and python interfaces
- Programmatically
Plan: several examples of Bayesian data science using Blang
- Modelling rocket failure probabilities (basic Blang model concepts)
- Human pedigrees (composing models)
- Hierarchical modelling (working with tidy data)
- CRISPR-Cas9 (distributions as parameters)
- Permutation inference (custom data types)
Please interrupt me early and ofter for questions!
Bayesian analysis: pros and cons
- Addresses most data analysis issues (missing data, non-standard data types, non-iid, weird loss functions, adding expert knowledge, …)
- Bayesian analysis: addresses those in a (semi) automated fashion / principled framework (“reductionist”)
- Reductionism can be bad or good (main con of reductionism is computational)
- Frequentist statistics: every problem is a new problem
- Implementation complexity
- Efficient in analyst’s time (thanks to PPLs)
- Harder to scale computationally
- \(\Longrightarrow\) shines on “precious data” problems
- Statistical properties
- Optimal if the model is well-specified
- Sub-optimal in certain cases when the model is mis-specified
- Thankfully the modelling flexibility makes it easier to build better models
- Important to make model checks
Bayesian statistics in the 20th century
If you learned Bayesian analysis from typical stats curriculum you may have left with a bad impression of Bayesian statistics
- Textbooks emphasize “regular” examples where both frequentist and Bayesian methods readily apply
- Reason: students need to work out computations by hand
- Simple statistical problems are precisely the type of situations where the Bayesian choice is less attractive in my opinion
Conundrum
- Real data have many quirks (complex missingness, non-standard data types, non-iid, weird loss functions, adding expert knowledge, …)
- Building quirky Bayesian models for quirky data is relatively easy
- But…
- Computing the posterior distribution of quirky models is often impossible to do analytically
- Coding MCMC or Variational by hand is error prone and time consuming
Bayesian statistics in the 21st century: PPL to the rescue!
Probabilistic Programming Languages (PPL): automating \[\text{Probability model + Data} \xrightarrow{\text{PPL}} \text{Posterior distribution} \]
Why this is revolutionary: potential to build quirky Bayesian methods adapted to the inferential problem (instead of adapting the problem to the method as often done nowadays)
Danger: increasing focus on PPLs such as Stan specializing on “regular data” (e.g. assuming a fixed set of built-in types for latent variables, real and/or integer)
- Scalable posterior inference is easier to automate if all latent variables are in \(R^d\) (thanks to Hamiltonian Monte Carlo (HMC))
- Important to have tools that are computationally fast
- However this should not come at the cost of not being able to deal with quirky data
Blang: a PPL for quirky data
Enabling technology: methods based on distribution continua (non-reversible parallel tempering and sequential change of measure)
Introductory example: rocket launches
- Would you rather get strapped to…
- “shiny rocket”: 1 success, 0 failures
- “rugged rocket”: 98 successes, 2 failures

Paradox?
- Maximum likelihood point estimates:
- “shiny rocket”: 100% success rate (1 success, 0 failures)
- “rugged rocket”: 98% success rate (98 successes, 2 failures)
- What is missing?
Uncertainty estimates
- Take-home message:
- Point estimates are often insufficient, and can be very dangerous
- We want some measure of uncertainty
- Bayesian inference provides one way to build uncertainty measures
- Bayesian measures of uncertainty: credible intervals
- Alternatives exist:
- Confidence intervals, from frequentist statistics
- Key difference: credible regions can be automatically approximated for any data type; not so for confidence intervals
Bayesian inference: high-level picture
- Construct a probability model including
- random variables for what we will measure/observe
- random variables for the unknown quantities
- those we are interested in (“parameters”, “predictions”)
- others that just help us formulate the problem (“nuisance”, “random effects”).
- Compute the posterior distribution conditionally on the actual data at hand
- Use the posterior distribution to:
- make prediction (point estimate)
- estimate uncertainty (credible intervals)
- choose an action (decision theoretic framework)
Towards a model: biased coins

- Standard coin: when you flip,
- get a heads (H; encoded as 1) with probability 0.5
- get a tails (T; encoded as 0) with probability 0.5
- “Biased” coin: when you flip,
- get a H with probability \(p \in [0, 1]\)
- get a T with probability \(1 - p\)
- Synonym: flips are Bernoulli random variables
- For us:
- 1 \(\leftrightarrow\) successful launch
- 0 \(\leftrightarrow\) failed launch
- Problem: we do not know \(p\)!
Making \(p\) random

- Imagine a bag with \(K+1\) biased coins
- Coin number \(i\) in \(\{0, 1, 2, ..., K\}\) has bias \(p_i = i/K\)
- Example: \(K+1 = 3\) coins
- First coin: \(p_1 = 0/2 = 0\)
- Second coin: \(p_2 = 1/2\)
- Third coin: \(p_3 = 2/2 = 1\)
Simple probability model for the rocket launch problem

- Step 1: Pick one of the \(K+1\) coins from the bucket
- \(\leftrightarrow\) Design a new rocket
- Step 2: Repeatedly flip the same coin
- \(\leftrightarrow\) Build a bunch of copies of the rocket, launch them and enjoy the show
Warm-up calculation
- Model
- Step 1: Pick one of the \(K+1 = 3\) coins from the bucket
- Step 2: Repeatedly flip the same coin

Warm-up calculation: If the first \(n=3\) flips all return heads, what is the posterior probability that the standard (\(p = 1/2\)) coin was picked?
Warm-up calculation, illustrated

- Decision tree: a recursive classification of all possible scenarios
- Nodes in the tree are “groups of scenarios”, i.e. an event
- Each column is a criterion for decomposing scenarios, i.e. a random variable
- Children of a node break an event into an exhaustive list of subcases
- Each edge \(A \to B\) is labelled with \({\mathbb{P}}(B|A)\)
- Bayes’ rule:
- Attack the slightly more general problem \(\pi(i) = {\mathbb{P}}(C_i | H_1, H_2, H_3)\)
- By definition of conditioning \[\pi(i) = \frac{{\mathbb{P}}(C_i, H_1, H_2, H_3)}{{\mathbb{P}}(H_1, H_2, H_3)}\] let us call the numerator \(\gamma(i)\) and the denominator, \(Z\), i.e. \(\pi(i) = \frac{\gamma(i)}{Z}\)
- Compute \(\gamma(0), \gamma(1), \gamma(2)\) \[\begin{aligned}
\gamma(0) &= 0 \\
\gamma(1) &= {\mathbb{P}}(C_1, H_1, H_2, H_3) = {\mathbb{P}}(C_1) {\mathbb{P}}(H_1 | C_1) {\mathbb{P}}(H_2 | C_1, H_1) {\mathbb{P}}(H_3 | C_1, H_1, H_2) = (1/3) (1/2) (1/2) (1/2) = 1/24 \\
\gamma(2) &= {\mathbb{P}}(C_2, H_1, H_2, H_3) = {\mathbb{P}}(C_2) {\mathbb{P}}(H_1 | C_2) {\mathbb{P}}(H_2 | C_2, H_1) {\mathbb{P}}(H_3 | C_2, H_1, H_2) = 1/3
\end{aligned}\]
- The vector \(\gamma\) is not a probability whereas \(\pi\) should be. Normalize it. \[
\pi(i) = \frac{\gamma(i)}{\sum_{i=0}^K \gamma(i)}
\]
So the answer is: \({\mathbb{P}}(C_1 | H_1, H_2, H_3) = \frac{1/24}{1/24 + 1/3} = 1/9 \approx 0.11\).
PPL implementation
Accessing the first example: click on Q1
model Coins {
random IntVar index ?: latentInt
random List<IntVar> flips ?: fixedIntList(1, 1, 1)
laws {
index ~ DiscreteUniform(0, 3)
for (IntVar flip : flips) {
flip | index ~ Bernoulli(index / 2.0)
}
}
}
Specifying a decision tree (probability model): the laws block
Statistical notation:
\[\begin{align}
I &\sim \text{Uniform}\{0, 1, 2, \dots, (K-1), K\} \\
X_n | I &\sim \text{Bernoulli}(I/K); n \in \{1, 2, 3\}
\end{align}\]

- Interpretation of \(I\): index of the coin I pick, \(X_n\), the \(n\)-th flip
- The notation \(X_n | I \sim \text{Bernoulli}(I/K)\) means..
- The conditional PMF of the random variable \(X_n\) given \(I\) is \(\text{Bernoulli}(I/K)\)
- I.e. for any \(x \in {\mathbf{R}}\), \(I \in \{1, 2, \dots, K\}\), the following conditional probability is obtained by looking up the picture on the right
- Or, mathematically
\[\begin{align}{\mathbb{P}}(X_n = x | I = i) =
\begin{cases}
i/K, & \text{if } x = 1 \\
1 - i/K, & \text{if } x = 0 \\
0, & \text{otherwise}
\end{cases}
\end{align}\]
Blang notation:
index ~ DiscreteUniform(0, 3)
for (IntVar flip : flips) {
flip | index ~ Bernoulli(index / 2.0)
}
- Extremely similar to the mathematical notation… this is by design! (and inspired by BUGS)
- Note:
package blang.distributions
/** Uniform random variable over the contiguous set of integers \(\{m, m+1, \dots, M-1\}\). */
@Samplers(UniformSampler)
model DiscreteUniform {
random IntVar realization
/** The left point of the set (inclusive). \(m \in (-\infty, M)\) */
param IntVar minInclusive
/** The right point of the set (exclusive). \(M \in (m, \infty)\) */
param IntVar maxExclusive
laws {
logf(minInclusive, maxExclusive) {
if (maxExclusive - minInclusive <= 0.0) return NEGATIVE_INFINITY
return -log(maxExclusive - minInclusive)
}
logf(realization, minInclusive, maxExclusive) {
if (minInclusive <= realization &&
realization < maxExclusive) return 0.0
else return NEGATIVE_INFINITY
}
}
generate(rand) {
rand.discreteUniform(minInclusive, maxExclusive)
}
}
Creating a model: DONE!
- DONE! Construct a probability model
- Compute the posterior distribution conditionally on the actual data at hand
- Use the posterior distribution to make prediction/estimate uncertainty
Setting up the posterior computation in Blang
- Which posterior? What is observed?
- 2 ways:
- specify in the Blang code (this example)
- command line options (will see in hierarchical model example, can also be used to override method 1)
random IntVar index ?: latentInt
random List<IntVar> flips ?: fixedIntList(1, 1, 1)
random means the laws block defines a (conditional) distribution on it
IntVar and List<IntVar>: the datatype of realizations of this random variable
latentInt mark the random variable as not observed (sampled)
fixedIntList(1, 1, 1) mark the random variable as observed: a list of three “1”’s (H flips)
latentInt and fixedIntList are two different types of initializations, see https://www.stat.ubc.ca/~bouchard/blang/Random_variables.html for more
Computing the posterior in Blang
- In Silico, just click Run!
- Command-line:
blang --model Coins --postProcessor DefaultPostProcessor
- This computes \[{\mathbb{P}}(\text{all unobserved random variables}|\text{all observed ones})\]
- Each run creates a set of results in
results/latest
- For example: posterior probability mass function (PMF): open
results/latest/posteriorPlots/index.pdf (in Silico, lower bottom panel)
Posterior PMF

Computing the posterior: DONE!
- DONE! Construct a probability model including
- DONE! Compute the posterior distribution conditionally on the actual data at hand
- Use the posterior distribution to make prediction/estimate uncertainty
How to use the posterior distributions
Samples are output in tidy format: for example samples for
random Matrix m ?: latentMatrix(2,2)
will produce a file results/latest/samples/m.csv that looks like this:
sample,row,col,value
0,0,0,-0.12892042622729627
0,0,1,-1.576368198773273
0,1,0,0.4712781753321984
0,1,1,1.104505677917305
1,0,0,1.1399400671309814
1,0,1,-0.7598126921692785
1,1,0,0.5586817875162722
1,1,1,1.0949660195131257
...
Useful things that can be automatically created at the end of a run
Click on the file configuration.txt (left panel): you will find
--postProcessor DefaultPostProcessor
Postprocessors are responsible for summarizing the posterior distributions. They can be customized, but the default one will create:
- Summary statistics (posterior means, median, SD, etc): in directory
results/latest/summaries
- Trace plots:
results/latest/tracePlots
- Effective sample size:
results/latest/ess
- Posterior density/mass functions:
results/latest/posteriorPlots
- Diagnostics plots (tempering adaptation):
results/latest/monitoringPlots
- Reproducibility information:
results/latest/arguments.tsv and results/latest/executionInfo
https://silico.io/abouchard/assignment/template/blang-workshop-nov-2019/q1/coins/latest
Use the posterior: DONE!
- DONE! Construct a probability model including
- DONE! Compute the posterior distribution conditionally on the actual data at hand
- DONE! Use the posterior distribution to make prediction/estimate uncertainty
Pedigree example: composing models and inference over combinatorial spaces

- First building block: Mendelian inheritance
- Suppose at a certain position in the genome there are two variants in the population (alleles): “A”(major) and “a” (minor)
- Since our first 22 chromosomes come in nearly identical pairs this means there are 3 possible genotypes: AA (both chromosomes with the major allele), aa (both with the minor one), Aa (one of each).
- The distribution of a child’s genotype is obtained as follows:
- Look at the two chromosomes from the mother. Flip a coin to choose one. Copy the value.
- Look at the two chromosomes from the father Flip an independent coin to choose one. Copy the value.

Blang code for Mendelian inheritance (Q3/Child.bl): https://silico.io/abouchard/assignment/template/blang-workshop-nov-2019/q3/pedigrees/latest
- Second building block: recessive trait
- Example: red hair
- If genotype is ‘AA’ the individual has red hair
- All other genotypes: the individual does not have red hair.
Blang code for a pedigree (Q3/Family.bl): https://silico.io/abouchard/assignment/template/blang-workshop-nov-2019/q3/pedigrees/latest

Exercise:
- Create a Blang model for recessive traits
- Run the pedigree example and inspect the posterior distribution
- Change the model to predict the probability Harry and Ginny’s child Albus has hair colour
Solution
model Recessive {
param IntVar numberOfVariantAlleles
random IntVar traitPresent
laws {
traitPresent | numberOfVariantAlleles ~ Bernoulli(if (numberOfVariantAlleles == 2) 1.0 else 0.0)
}
}
Note: deterministic relations such as the one above cause problems in other PPLs such as JAGS/WinBUGS, Blang handle these using a novel tempering strategy (annealing both the temperature and the support simultaneously).
Advanced example: distributions as parameters
See Q4/CensoredExchangeableCount.bl:
- this model takes an integer valued distribution as input,
param IntDistribution pmf
- it models the distribution over histograms produced from the pmf as follows:
- sample a Poisson random variable \(N \sim \text{Poisson}(\text{poissonRate})\)
- generate \(N\) iid integers according to
pmf
- censor the zeros
- this is part of a larger hierarchical model we are using (collab. with BC Cancer Agency) to model CRISPR-Cas9 data with “bar codes”.
Let us demonstrate how we can do Bayesian model selection using this model.
- The quantity \(Z\) encountered in the first example corresponds to the probability or density of the observations
- Comparing values of \(Z\) for different model is the default Bayesian model selection method
First, notice I used a different inference engine: in configuration.txt:
--engine SCM
--engine.nParticles 20
Warning: more particles should be used in practice
This means instead of using adaptive non-reversible parallel tempering (PT, the default), we use an adaptive Sequential Change of Measure algorithm (better at computing normalization constant \(Z\)).
The estimate for \(Z\) is in results/latest/logNormEstimate.csv and is also shown in the console (Log normalization constant estimate: -4668.584992824664)
Exercise:
- replace the Poisson PMF by a richer family
- compare the normalization constant to recommend a model selection
Solution
model Crispr {
random CountFrequencies frequencies ?: CountFrequencies::parse("83x1;70x1;66x1;55x1;51x1;48x1;47x1;44x1;42x1;39x1;37x1;31x2;30x2;27x1;26x2;25x5;24x3;23x4;22x2;21x2;20x3;18x2;17x2;16x4;15x5;14x7;13x4;12x9;11x11;10x12;9x12;8x10;7x15;6x16;5x31;4x39;3x54;2x109;1x725")
random RealVar
r ?: latentReal,
p ?: latentReal,
lambda ?: latentReal
laws {
p ~ Exponential(0.1)
r ~ Exponential(0.1)
lambda ~ Exponential(0.1)
frequencies | lambda, r, p
~ CensoredExchangeableCounts(
NegativeBinomial::distribution(r, p),
lambda
)
}
}
Gives logZ = -2281.094793017699, so this is a MUCH better model!
Note: no need for parameter counting!
Reproducibility note: running code twice will produce the same result even in multi-threading model (random seeds can be changed with --engine.random 1234).
Hierarchical example: getting tidy data into Blang and computing Bayes factors
Motivation:
- Say we want to predict a rocket success/fail for a type we do not have much data for
- E.g. “shiny rocket” slide from first lecture: Falcon Heavy, 1 success, 1 trial
- We will build on the first example
- When data is sparse, posterior is sensitive to specification of the prior
- Extreme example: for a maiden flight, what will the posterior look like?
Key idea: use “side data”" to inform the prior
- For example: success/fail launch data from other other types of rockets
- Can we use the full rocket launch data provided to inform prediction for a single rocket type of interest?
data <- read.csv("failure_counts.csv")
data %>%
head() %>%
knitr::kable(floating.environment="sidewaystable")
| Aerobee |
1 |
0 |
| Angara A5 |
1 |
0 |
| Antares 110 |
2 |
0 |
| Antares 120 |
2 |
0 |
| Antares 130 |
1 |
1 |
| Antares 230 |
1 |
0 |
How to (badly) use side data
- Let’s make the number of coins in the bucket “go to infinity”, i.e switch to a continuous prior over success/failure probability
\[\begin{align}
p &\sim \text{Beta}(1, 1) \\
X_n | p &\sim \text{Bernoulli}(p); n \in \{1, 2, 3\}
\end{align}\]
- Let us also rewrite the likelihood using a binomial distribution:
nHeads | nTrials, p ~ Binomial(nTrials, p)
- Finally, let us give each rocket type is given its own success probability: see Q5/Hierarchical.bl

Practical problem: how to get observations into Blang? Let’s look at the comments in Q5/Hierarchical.bl…
However is this a useful model in terms of improving predictions for the type of rocket of interest?
By the way, to restrict to only Ariane 1: I used
...
for (Index<String> rocketType : rocketTypes.indices.filter[key == "Ariane 1"]) {... }
...

Towards an improved use of side data
- Background: “mean-pseudo-sample-size” reparameterization of the Beta distribution
- A reparametrization is a different labelling of a family such that you can go back and forth between the two labellings
- Consider \[\alpha = \mu s, \;\; \beta = (1 - \mu) s\] where \(\mu \in (0, 1)\), \(s > 0\).
- Interpretation:
- \(\mu\): mean of the Beta
- \(s\): measure of “peakiness” of the density, higher \(s\) corresponds to more peaked; roughly, \(s \sim\) number of data points that would make the posterior peaked like that.
- Why we did this reparameterization?
- Now it should be more intuitive how we could go about using side data to inform at least \(\mu\)
- Ideas?
An improved way use of side data (not quite full Bayesian yet!)
- Estimate \(p_i\) for each type of rocket
- Fit a distribution
- Use this distribution as the prior on \(\mu\)
- Weakness? Hint: what are the bumps at 1/2 and 1?
- Also: less clear how to do this with the pseudo-sample-size \(s\)
counts <- read.csv("failure_counts.csv")
ggplot(counts, aes(x = numberOfFailures / numberOfLaunches)) +
geom_histogram()

Solution: go fully Bayesian
Recall:
- Construct a probability model including
- random variables for what we will measure/observe
- random variables for the unknown quantities
- those we are interested in (“parameters”, “predictions”)
- others that just help us formulate the problem (“nuisance”, “random effects”).
- Compute the posterior distribution conditionally on the actual data at hand
- Use the posterior distribution to make prediction, estimate uncertainty, make a decision, etc

In our case: just make \(\mu\) and \(s\) random! (or equivalently, \(\alpha\) and \(\beta\))
- Share these two “global parameters” across all launch types \[p_i | \mu, s \sim {\text{Beta}}(\mu s, (1 - \mu) s)\]
- We still need to put prior on \(\mu\) and \(s\)
- But you should expect this prior choice to be less sensitive. Why? Hint: look at the number of outgoing edges in the graphical model. Compare with number of outgoing edges for the “maiden flight” (see next slide)
- Example: \(\mu \sim {\text{Beta}}(1,1) = {\text{Unif}}(0, 1)\), \(s \sim \text{Exponential}(1/10000)\) (why such a small value? Hint: mean vs. rate parameter)
Exercise
In: Q5:
- Modify
Hierarchical.bl to make the model a hierarchical model
- Check if the shape of the posterior distribution changes
Solution
model Hierarchical {
param GlobalDataSource data
param Plate<String> rocketTypes
param Plated<IntVar> numberOfLaunches
random Plated<RealVar> failureProbabilities
random Plated<IntVar> numberOfFailures
random RealVar m ?: latentReal, s ?: latentReal
laws {
m ~ ContinuousUniform(0, 1)
s ~ Exponential(0.1)
for (Index<String> rocketType : rocketTypes.indices) {
failureProbabilities.get(rocketType) | m, s ~ Beta(m*s, (1.0-m)*s)
numberOfFailures.get(rocketType)
| RealVar failureProbability = failureProbabilities.get(rocketType),
IntVar numberOfLaunch = numberOfLaunches.get(rocketType)
~ Binomial(numberOfLaunch, failureProbability)
}
}
}
Summary when there is no sharing:
data <- read.csv("figs/failureProbabilities-summary-one.csv")
data %>%
head() %>%
knitr::kable(floating.environment="sidewaystable")
| 1 |
Ariane 1 |
0.2322357 |
0.1168141 |
0.0150489 |
0.2152880 |
0.6411600 |
| 2 |
Ariane 2 |
0.2572539 |
0.1460565 |
0.0055072 |
0.2395823 |
0.7926601 |
| 3 |
Ariane 3 |
0.1558292 |
0.0965917 |
0.0044491 |
0.1351953 |
0.6279660 |
| 4 |
Ariane 40 |
0.1078332 |
0.0980382 |
0.0001226 |
0.0793672 |
0.5976650 |
| 5 |
Ariane 42L |
0.0621791 |
0.0560808 |
0.0001023 |
0.0439521 |
0.3481125 |
| 6 |
Ariane 42P |
0.1181305 |
0.0771131 |
0.0021443 |
0.1029950 |
0.4222125 |
Summary with hierarchical sharing:
data <- read.csv("figs/failureProbabilities-summary-hier.csv")
data %>%
head() %>%
knitr::kable(floating.environment="sidewaystable")
| 1 |
Ariane 1 |
0.1169992 |
0.0692933 |
0.0064677 |
0.1036391 |
0.4533803 |
| 2 |
Ariane 2 |
0.0998560 |
0.0749805 |
0.0003683 |
0.0816964 |
0.4872174 |
| 3 |
Ariane 3 |
0.0799940 |
0.0560602 |
0.0018218 |
0.0661795 |
0.4222133 |
| 4 |
Ariane 40 |
0.0471762 |
0.0471762 |
0.0000000 |
0.0344085 |
0.3723142 |
| 5 |
Ariane 42L |
0.0371144 |
0.0380083 |
0.0000001 |
0.0263615 |
0.3403815 |
| 6 |
Ariane 42P |
0.0644173 |
0.0438645 |
0.0010904 |
0.0549336 |
0.2955533 |

New higher-level hyperparameters = new problems? No we are probably ok!
It seems we have introduced new problems as now we again have hyperparameters, namely those for the priors on \(\mu\) and \(s\). Here we picked \(\mu \sim {\text{Beta}}(1,1) = {\text{Unif}}(0, 1)\), \(s \sim \text{Exponential}(1/10000)\)
Key point: yes, but now we are less sensitive to these choices!
Why? Heuristic: say you have a random variable connected to some hyper-parameters (grey squares) and random variables connected to data (circles)
- If most of the connections are hyper-parameters: will probably be sensitive
- If there are many more connections to random variables compared to hyper-parameters: will probably be insensitive
Before going hierarchical: for maiden/early flights we had

After going hierarchical:

Example: permutations and user defined data types
- Permutation: a bijection from \[\{1, 2, \dots, n\} \to \{1, 2, \dots, n\}\].
- Statistical applications in record linkage
Unidentifiable example: how Blang works under the hood
- Computing posterior distributions can be very challenging
- Mental picture most people have for hard problems:

- This is a very partial part of the story
- What is missing:
- often in practice hard posterior distribution are rather due to unidentifiabilily
- this creates ridges on which HMC is either slow or requires high Hessian compute cost per step

- Plan
- Introduce an unidentifiable posterior inference problem
- Compare Blang and Stan’s inference engines on that problem
Unidentifiability
When there are too many parameters, you may not be able to recover them even if you had “infinite data”
Modifications to the coin/rocket example
- Let’s make the number of coins in the bucket “go to infinity”, i.e switch to a continuous prior over success/failure probability
\[\begin{align}
p &\sim {\text{Unif}}(0, 1) \\
X_n | p &\sim \text{Bernoulli}(p); n \in \{1, 2, 3\}
\end{align}\]
- Let us also rewrite the likelihood using a binomial distribution:
nHeads | nTrials, p ~ Binomial(nTrials, p)
- Finally, let us modify the prior for the rocket example (continuous version), setting \(p = p_1 p_2\) where \(p_1 \sim {\text{Unif}}(0,1)\) and \(p_2 \sim {\text{Unif}}(0,1)\).
Intuition: there are pairs of \((p_1, p_2)\) that yield the same joint density, e.g. \((p_1, p_2) = (0.1, 0.2)\) and \((p_1, p_2) = (0.05, 0.4)\).
Model code in the Blang PPL
model Unidentifiable {
...
laws {
p1 ~ ContinuousUniform(0.0, 1.0)
p2 ~ ContinuousUniform(0.0, 1.0)
nHeads | nTrials, p1, p2 ~ Binomial(nTrials, p1 * p2)
}
}

Very brief overview of Blang’s posterior inference engine
Annealing

Combination of two ideas: “annealing” and “parallel chains”
- First idea: annealing
- suppose a density has height \(h\) at a point
- consider \(h^{0.0001}\), note that this \(\approx 1\), for both \(h=10\) and \(h=0.1\)
- taking a small power of a positive number brings it closer to \(1\)
- let’s do that with every point in the landscape we seek to explore!
\[\pi_{\beta}(x) = (\pi(x))^{\beta}\]
where \(\beta\) controls an interpolation between:
- the uniform distribution (\(\beta = 0\), easy to sample from)
- the posterior distribution (\(\beta = 1\), hard)
Let us call \(\beta\) the annealing parameter
Now we have several different “landscapes”, from smooth to rough. We will explore all of them simultaneously!
Swaps:
- Let’s pick a sequence of annealing parameters, say \((\beta_0, \dots, \beta_N) = (0, 1/N, 2/N, \dots, 1)\)
- Second idea: parallel tempering
- Run several MCMC chains in parallel (each say based on slice sampling) for each landscape \(\pi_i(x) = (\pi(x))^{\beta_i}\) including the flat one (\(i=0\) where \(\beta = 0\)) and the landscape of interest (\(i=N\) where \(\beta = 1\))
- This ensemble of chains provides us with samples from the high dimensional landscape \(\pi(x_0, x_1, \dots, x_N) = \prod_i \pi_i(x_i)\)
- At the end, only keep the states in \(x_N\), i.e. those corresponding to \(\beta = 1\) (the posterior)
- But use the other ones to propose swaps between pairs of chains to allow “jumps” across the state space
- Swap are accepted or rejected according to the Metropolis rule from earlier today (optional exercise: can you derive the acceptance probability?)

Non-reversible parallel tempering
- You may need many parallel chains to be able to swap
- Naive parallel tempering then gets really slow
- Ongoing work in my research group: non-reversible parallel tempering avoid this issue
Parallel Tempering: Reversible (top) versus non-reversible (bottom)

- More details in my talk tomorrow